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'  Abstract 

This  paper  describes  the  one-speaker  detection  sys¬ 
tems  submitted  by  AFRL/HEC  for  several  of  the  train¬ 
ing  and  testing  conditions  in  the  2005  NIST  Speaker 
Recognition  Evaluation.  For  each  condition,  the  over¬ 
all  system  score  was  the  weighted  combination  of  scores 
from  several  component  systems.  The  component  sys¬ 
tems  were  based  on  (1)  mel-ffequency  cepstral  coeffi¬ 
cients  (MFCCs)  and  Gaussian  mixture  models  (GMMs); 

(2)  MFCCs  and  phoneme-specific  GMMs  (PS-GMMs); 

(3)  linear-prediction-based  cepstral  coefficients  (LPCCs) 
from  closed-phase  analysis;  (4)  formant  center  frequen¬ 
cies,  formant  bandwidths,  and  fundamental  frequency 
(FMBWFO);  and  (5)  word  language  modeling  (WLM). 
The  score  combination  was  done  using  single-layer  per- 
ceptrons,  with  the  grouping  of  the  component  systems  de¬ 
pending  on  the  lengths  of  the  training  and  testing  files. 
For  some  of  the  testing  and/or  training  conditions  involv¬ 
ing  ten-second  speech  files,  the  system  performance  im¬ 
proved  from  the  inclusion  of  the  FMBWFO  and  LPCC 
systems,  while  the  MFCC/PS-GMM  system  provided  ad¬ 
ditional  benefits  in  the  one-conversation  testing  condi¬ 
tions  involving  larger  amounts  of  training  data, 

1.  Introduction 

This  paper  describes  the  speaker  recognition  systems  sub¬ 
mitted  by  APRL/HEC  for  the  (four- wire)  one-speaker  de¬ 
tection  conditions  in  the  2005  Speaker  Recognition  Eval¬ 
uation  (SRE)  sponsored  by  the  National  Institute  of  Stan¬ 
dards  and  Technology  (NIST)  [l].1  One  of  the  recent 
trends  in  speaker  recognition  is  the  fusion  or  combina¬ 
tion  of  the  output  scores  from  several  systems  such  as 
in  [3]  to  provide  an  overall  score,  and  our  system  was 
similar  in  this  regard.  For  each  condition,  the  overall 
system  score  was  the  weighted  combination  of  scores 
from  several  component  systems.  The  component  sys¬ 
tems  were  based  on  (1)  mel-frequency  cepstral  coeffi¬ 
cients  (MFCCs)  and  Gaussian  mixture  models  (GMMs); 

(2)  MFCCs  and  phoneme-specific  GMMs  (PS-GMMs); 

(3)  linear-prediction-based  cepstral  coefficients  (LPCCs) 

Opinions,  interpretations,  and  conclusions  are  those  of  the  authors 
and  are  not  necessarily  endorsed  by  the  United  States  Air  Force. 

*The  AFRL/HEC  system  submitted  for  the  conditions  requiring 
speaker  segmentation  and  clustering  is  described  in  [2], 


from  closed-phase  analysis  and  GMMs;  (4)  formant  cen¬ 
ter  frequencies,  formant  bandwidths,  and  fundamental 
frequency  with  GMMs  (denoted  here  by  FMBWFO);  and 
(5)  language  modeling  on  the  words  from  speech  recog¬ 
nition  transcripts  (denoted  here  by  WLM).  For  testing  or 
training  conditions  involving  short  speech  files,  the  scores 
from  the  MFCC,  FMBWFO,  and  LPCC  systems  were 
combined  using  a  single-layer  perceptron  (SLP).  For  test¬ 
ing  and  training  conditions  involving  larger  amounts  of 
speech  data,  the  score  combination  was  done  in  two 
stages.  First,  the  scores  from  fifteen  PS-GMM  systems 
were  combined  using  an  SLP.  Then,  the  output  score  from 
the  SLP  was  combined  with  the  scores  from  the  MFCC, 
FMBWFO,  LPCC,  and  WLM  systems  to  yield  the  final 
score. 

We  show  that,  compared  to  the  baseline 
MFCC/GMM  system,  the  inclusion  of  the  FMBWFO  and 
LPCC  systems  improved  the  performance  for  some  of  the 
training  and/or  testing  conditions  involving  ten-second 
speech  files,  while  the  inclusion  of  the  MFCC/PS-GMM 
system  improved  the  performance  for  the  training  and 
testing  conditions  involving  larger  amounts  of  data. 

An  outline  of  the  paper  is  as  follows.  The  next  section 
briefly  describes  the  2005  evaluation  conditions  consid¬ 
ered  in  this  paper.  Section  3  describes  the  component  sys¬ 
tems  as  well  as  the  speech  activity  detector  (SAD)  used 
with  some  of  the  GMM-based  systems,  while  Section  4 
describes  the  development  of  the  score  combination  sys¬ 
tem.  Section  5  presents  the  evaluation  performance  re¬ 
sults,  and  Section  6  presents  the  results  of  some  post¬ 
evaluation  experiments  aimed  at  improving  the  use  of  the 
PS-GMM  system.  Finally,  Section  7  presents  the  conclu¬ 
sions. 

2.  The  NIST  2005  Speaker  Recognition 
Evaluation 

The  NIST  2005  SRE  consisted  of  20  distinct  tasks  [1]. 
Here,  we  consider  the  eight  standard  one-speaker  detec¬ 
tion  tasks,  consisting  of  four  training  conditions  by  two 
testing  conditions.  The  training  conditions  all  involved 
four- wire  (two-channel)  conversations  and  were  defined 
by  the  following  amounts  of  data:  (1)  an  excerpt  esti¬ 
mated  to  contain  approximately  10  seconds  of  speech  of 
the  target  on  its  designated  side  (designated  as  10sec4w), 
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(2)  one  conversation  of  approximately  five  minutes  total 
duration  with  the  target  speaker  (designated  as  1  conv4w), 

(3)  three  conversations  involving  the  target  speaker  (des¬ 
ignated  as  3conv4w),  and  (4)  eight  conversations  involv¬ 
ing  the  target  speaker  (designated  as  8conv4w).  The  test¬ 
ing  conditions  involved  either  (1)10  seconds  of  speech 
(designated  as  10sec4w)  or  (2)  one  five-minute  conver¬ 
sation  (designated  as  lconv4w)  as  in  the  10sec4w  and 
1  conv4w  training  conditions,  respectively. 

In  addition  to  the  speech  files,  NIST  provided  tran¬ 
scripts  produced  by  an  English-language  speech  recog¬ 
nition  system  from  BBN  with  word  error  rates  typically 
in  the  range  of  15-30%  for  English  conversational  tele¬ 
phone  speech.  English  language  transcripts  were  pro¬ 
vided  for  all  files,  despite  the  fact  that  some  of  the  files 
contained  speech  in  other  languages — namely,  Arabic, 
Mandarin,  Russian,  and  Spanish. 

NIST  compares  system  performance  in  two  major 
ways.  First,  NIST  uses  a  detection  cost  function,  Co, 
defined  as  a  weighted  sum  of  miss  and  false  alarm  prob¬ 
abilities: 

Co  =  CmPm\tPt  +  CfaPfa\nt{1  -  Pt), 

where  Cm  is  the  cost  of  a  miss  (chosen  by  NIST  as  10), 
Cfa  is  the  cost  of  a  false  alarm  (chosen  by  NIST  as  1), 
Pt  is  the  a  priori  probability  of  a  target  (chosen  by  NIST 
as  0.01),  Pm\t  is  the  probability  of  a  miss  given  a  tar¬ 
get  trial,  and  Pfa\nt  is  the  probability  of  a  false  alarm 
given  a  non- target  trial.  Pm\t  and  Pfa\nt  are  a  function 
of  system  performance  and  the  chosen  detection  thresh¬ 
old.  For  a  given  system,  chosen  costs,  and  a  priori  target 
probability,  there  is  a  threshold  that  yields  a  minimum 
value  of  Cd;  we  refer  to  this  minimum  value  of  Co  as 
the  minDCF  value.  Second,  NIST  uses  plots  of  Pm\t 
versus  Pfa\nt ,  called  Detection  Error  Trade-off  (DET) 
plots  [4],  to  show  how  system  performance  varies  for  a 
wide  range  of  operating  points.  In  addition  to  these  two 
presentations  of  performance,  we  will  also  use  the  equal 
error  rate  (EER),  the  value  of  Pm\t  (or  Pfa\nt )  when 

P M\T  —  P FA\NT' 

3.  Component  Systems 

The  overall  system  consisted  of  various  combinations  of 
the  scores  from  five  component  systems,  depending  on 
the  length  of  the  training  and  testing  files.  Four  of  the 
component  systems  were  based  on  GMMs,  while  the 
WLM  system  involved  language  modeling.  The  next  sub¬ 
section  discusses  the  GMM-based  systems,  while  Sub- 
section  3.2  discusses  the  WLM  system. 

3.1.  GMM-Based  Systems 

This  section  discusses  the  various  GMM-based  systems. 
Common  aspects  of  the  systems  are  presented  in  the  next 
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subsection,  while  the  unique  aspects  of  each  feature  set 
are  discussed  in  their  respective  subsections. 

3.1.1.  Overview  of  GMM-Based  Systems 

The  GMM-based  systems,  regardless  of  feature  set,  all 
used  Version  2.1  of  the  MIT  Lincoln  Laboratory  (MIT- 
LL)  MFCC/GMM  system  [5]  with  2048  mixtures  per 
model  and  diagonal  covariance  matrices  for  each  mixture. 

All  of  the  GMM-based  systems  used  a  common 
speech  activity  detector  (SAD),2  which  worked  in  three 
stages.  The  first  stage  utilized  a  two-state  speech/non¬ 
speech  Hidden  Markov  Model  (HMM)  with  MFCCs  as 
the  features.  The  second  stage  refined  the  HMM  out¬ 
put  by  applying  an  energy-based  detector.  The  final 
stage  post-processed  the  output  by  reclassifying  as  non¬ 
speech  any  segments  labeled  as  speech  that  were  less 
than  20  msec  in  duration.  The  MFCC/HMM  portion  of 
the  SAD  was  built  using  HTK  from  Cambridge  Univer¬ 
sity3  using  64  mixtures  per  state.  The  energy-based  de¬ 
tection  was  performed  using  the  MIT-LL  xtalk  program 
from  their  MFCC/GMM  speaker  recognition  system. 

The  cepstral-coefficient- based  systems  (/. e.,  all  of 
the  GMM-based  systems  except  the  FMBWF0  system) 
shared  a  number  of  additional  similarities.  Each  set  of 
cepstra!  coefficients  had  RASTA  filtering  [6]  applied  and 
included  the  deltas  of  the  features.  After  the  RASTA  fil¬ 
tering  of  the  cepstral  features  and  the  deltas  were  added, 
feature  mapping  [7]  was  also  used;  however,  the  channel 
was  always  chosen  using  the  channel  determined  by  the 
MFCCs.  Finally,  the  mapped  features  were  normalized 
to  have  zero  mean  and  unit  variance. 

Gender-dependent  T-norm  [8]  was  applied  (using 
120  models  for  each  gender),  with  the  exception  that 
gender-independent  T-norm  (with  240  models)  was  used 
in  the  10sec4w  training  conditions.  For  the  10sec4w- 
10sec4w  training/testing  condition,  T-norm  models  were 
built  from  30  seconds  of  data.  For  the  other  training  con¬ 
ditions,  T-norm  models  were  built  using  approximately 
two  minutes  of  data. 

The  background  model  data  consisted  of  approxi¬ 
mately  16  hours  of  speech  from  a  variety  of  sources, 
including  the  NIST  2001-2003  evaluations  (for  carbon 
button  land  line  data,  electret  microphone  land  line  data, 
and  digital  cellular  data)  and  the  OGI  National  Cellular 
Database4  (for  analog  cellular  data).  The  background 
model  data  were  balanced  for  gender  and  the  four  previ¬ 
ously  mentioned  channel  types,  and  these  channels  were 
the  ones  used  in  the  feature  mapping.  The  T-norm  model 
data  came  from  NIST  2001-2003  evaluation  data. 


2The  MFCC/PS-GMM  system  only  used  this  SAX)  if  the  SAD  from 
the  SONIC  speech  recognizer  foiled  to  find  any  speech  frames. 

3 Available  at:  http://htk.eng.cam.ac.uk/ 

4 See:  http://cslu.cse.ogi.edu/corpora/corpCurrent.html 


3.1.2 .  The  MFCC/GMM  System 

Nineteen  MFCCs  were  computed  using  the  MIT-LL 
GMM  system  [5]  in  the  bandwidth  of  300-3138  Hz  ev¬ 
ery  10  msec.  RASTA  filtering  was  applied  to  the  MFCCs 
and  deltas  were  then  calculated.  Only  frames  labeled 
as  speech  by  the  SAD  (discussed  in  Section  3.1.1)  were 
used.  The  remaining  processing  was  performed  as  dis¬ 
cussed  in  Section  3.1.1.  In  building  target  and  T-norm 
models,  only  the  mixture  means  were  adapted  from  those 
of  the  background  model. 

3.1.3.  The  LPCC System 

The  LPCC  system  calculated  1 6  cepstral  coefficients  (ex¬ 
cluding  the  0ttl  cepstral  coefficient)  from  the  linear  pre¬ 
diction  (LP)  parameters  derived  from  smoothed  closed- 
phase  analysis  as  described  in  [9].  The  cepstral  coeffi¬ 
cients  were  computed  from  the  LP  parameters  using  the 
recursion  outlined  in  [10].  The  features  were  only  cal¬ 
culated  for  voiced  speech  frames,  where  the  voicing  was 
determined  using  the  get  JO  program  from  the  Entropic 
Signal  Processing  System  (ESPS).  RASTA  filtering  was 
applied,  and  the  feature  set  included  the  deltas  of  the  fea¬ 
tures.  The  remaining  processing  was  performed  as  dis¬ 
cussed  in  Section  3.1.1.  In  building  target  and  T-norm 
models,  only  the  mixture  means  were  adapted  from  those 
of  the  background  model. 

3.1.4.  The  FMBWFO  System 

The  FMBWFO  system  was  similar  to  that  of  [1 1].  First, 
F0  and  the  probability  of  voicing  were  determined  ev¬ 
ery  10  msec  using  the  ESPS  get  JO  command,  which  im¬ 
plements  the  pitch  tracking  algorithm  described  in  [12]. 
Next,  the  first  three  formant  center  frequencies  (F1-F3) 
and  the  first  three  formant  bandwidths  (B1-B3)  were  de¬ 
termined  from  Snack  Version  2.2.2  from  KTH.5  Each  F0 
value  was  converted  to  log  scale.  Each  formant  center 
frequency  and  bandwidth  value  was  converted  to  radians. 

Extracted  frames  had  (1)  to  be  declared  to  be  speech 
by  the  SAD,  (2)  to  be  voiced;  (3)  to  have  F0  <  250  Hz; 
and  (4)  to  have  (FI,  F2,  F3)  ^  (500  Hz,  1500  Hz, 
2500  Hz).  Condition  (3)  was  imposed  because  the  pitch 
extractor  was  found  to  output  pitch-doubled  frames  at 
times,  while  condition  (4)  was  imposed  to  eliminate 
frames  where  the  formant  tracker  failed  (at  which  point  it 
would  output  the  default  values  of  500,  1500,  and  2500). 

These  features  were  used  in  the  GMM  system,  and 
T-norm  was  applied  as  discussed  in  Section  3.1.1.  Target 
and  T-norm  models  were  adapted  from  the  background 
model  by  updating  the  weights,  means,  and  variances. 


5  Available  at:  http://www.speech.kth.se/snack 


3. 1.5.  The  MFCC/PS- GMM  System 

The  basic  idea  of  the  MFCC/PS-GMM  system  is  to 
assign  each  feature  vector  a  phoneme  label,  build  a 
GMM  for  each  phoneme  for  each  speaker,  score  each  la¬ 
beled  feature  vector  against  the  proper  phoneme-specific 
model,  and  combine  the  phoneme-specific  scores  to 
form  a  single  output  score.  The  MFCC/PS-GMM  sys¬ 
tem  was  similar  to  the  system  described  in  [13]  that 
used  phoneme-only  adaptation,  but  with  some  notable 
changes.  First,  this  year's  system  used  MFCCs  that  were 
computed  as  in  the  MFCC/GMM  system  described  in 
Sections  3.1.1  and  3.1.2,  including  the  use  of  feature 
mapping,  which  was  not  used  in  the  system  of  [13],  Sec¬ 
ond,  each  feature  vector  was  associated  with  a  phoneme 
label  as  output  by  the  SONIC  speech  recognizer  (Ver¬ 
sion  2.0-beta2)  from  the  University  of  Colorado  at  Boul¬ 
der  [14, 15],  whereas  the  system  of  [13]  used  phoneme 
labels  from  speech  recognition  transcripts  provided  by 
Stanford  Research  Institute  for  the  NIST  2003  Extended 
Data  Task.  Thus,  the  phoneme  alignments  were  con¬ 
structed  from  the  state  file  output  by  SONIC,  and  the  fea¬ 
ture  vectors  for  a  given  phoneme  were  then  scored  with 
a  GMM  built  for  that  phoneme.  Third,  in  contrast  to  the 
system  of  [13],  the  GMM  for  each  phoneme  used  2048 
mixtures.  Finally,  only  phonemes  from  the  following  set 
were  used:  {AE,  AH,  AX,  AY,  DH,  EH,  EY,  IH,  IY,  L,  M, 
N,  OW,  S,  Y},  in  contrast  to  the  larger  set  used  in  [13]. 

There  are  some  additional  points  worth  noting.  First, 
SONIC  has  its  own  SAD,  so  the  SAD  described  in  Sec¬ 
tion  3.1.1  was  only  used  if  the  SONIC  SAD  returned  no 
speech  frames.  Second,  the  acoustic  and  trigram  lan¬ 
guage  models  used  with  SONIC  were  trained  using  land 
line  data  from  the  Switchboard  database.6  Third,  tar¬ 
get  and  T-norm  phoneme-specific  models  were  adapted 
from  the  background  phoneme-specific  models  by  updat¬ 
ing  only  the  means.  Finally,  the  scores  for  each  phoneme 
(after  the  phoneme-dependent  T-norm  was  applied)  were 
combined  with  a  perceptron  neural  net  that  was  trained 
using  the  MIT-LL  LNKnet  package.7  The  neural  net  used 
no  hidden  layers,  and  the  output  nonlinearity  was  a  stan¬ 
dard  sigmoid.  The  neural  net  was  trained  using  data  from 
the  NIST  2004  Evaluation. 

3.2.  The  WLM  System 

The  WLM  system  is  motivated  by  the  original  work 
done  by  Doddington  on  idiolectal  differences  between 
speakers  [16].  The  CMU-Cambridge  Language  Model¬ 
ing  Toolkit8  (Version  2.05)  formed  the  basis  of  this  sys¬ 
tem.  The  words  from  the  NIST-supplied  transcripts  were 
assembled  into  pseudo  sentences,  where  a  pause  greater 
than  one  second  between  words  defined  a  sentence  break. 


6See  http://www.ldc.upenn.edu 

Available  at:  http://www.ll.mit.edu/lST/lnknet 

Available  at:  http://svr-www'.eng.eam.ac.uk/  pro  14/tool kit.html 


Using  no  sentence  breaks,  where  each  conversation  side 
became  one  sentence,  yielded  worse  performance  than 
using  pseudo  sentence  breaks  when  tested  on  previous 
NIST  evaluations. 

Bigram  language  models  with  back-off  were  trained 
with  the  following  parameters  set  in  the  toolkit:  top 
20,000  words,  Witten-Bell  discounting,  and  zero  cut-offs. 
Target  models  were  trained  by  concatenating  all  the  sen¬ 
tences  for  each  of  the  conversations  allowed  for  each 
model,  while  the  background  model  was  built  in  a  sim¬ 
ilar  way,  but  with  all  the  sentences  from  all  the  files  that 
made  up  the  background  model.  The  background  model 
data  came  from  Switchboard  II. 

To  compute  a  score  using  the  WLM  system,  the  sen¬ 
tences  from  a  test  file  were  tested  against  a  claimant 
model  and  the  background  model.  The  score  for  a  given 
test  file  and  claimant  model  pair  was  computed  as  fol¬ 
lows.  Let  Be  be  the  set  of  bigrams  in  the  claimant 
model,  C;  Bg  be  the  set  of  bigrams  in  the  background 
model;  and  Bj  be  the  set  of  bigrams  in  a  test  file,  T.  Let 
Btcb  =  Bt  H  Be  H  Bb,  and  let  Ntcb  be  the  number 
of  bigrams  in  Btcb  •  Let  P^c  be  the  probability  of  bi¬ 
gram  b  in  model  C  and  be  the  probability  of  bigram 
b  in  the  background  model.  The  score  for  T  against  the 
claimant  model  C  was  computed  as: 


s(T,C) 


1 

Ntcb 


£ 


log(P6,c)  -  lo g(Pd,B)- 


Thus,  unknown  or  non-matching  bigrams  were  ignored. 

One  final  step  was  taken  with  the  inclusion  of  a 
gender-independent  T-norm.  Fifty  male  and  fifty  female 
models  were  built  using  two  conversation  sides  of  data 
from  Switchboard  II  with  transcripts  supplied  by  NIST 
that  were  generated  by  a  BBN  speech  recognizer.9 


4.  System  Combination  and  Thresholds 

For  all  of  the  one-speaker  detection  training  and  testing 
conditions,  the  component  system  scores  were  combined 
using  SLPs  built  from  the  2004  evaluation  data  using 
LNKnet.  For  each  training  and  testing  condition,  the  test 
control  file  (i.e.,  the  list  of  test  file/target  model  pairs) 
from  the  2004  evaluation  was  split  into  ten  disjoint  parts. 
In  other  words,  there  were  no  test  file/target  pairs  com¬ 
mon  to  two  or  more  parts  (thus,  one  could  concatenate 
the  parts  to  recover  all  of  the  test  file/target  model  pairs 
from  the  original  control  file).  Further,  all  of  the  test  files 
from  a  given  speaker  were  contained  in  a  single  split  con¬ 
trol  file.  For  each  split  control  file,  a  training  control  file 
was  constructed  from  the  original  control  file  such  that 
it  had  no  speakers  in  common  with  the  split  control  file 
either  in  terms  of  test  files  or  in  terms  of  target  models. 

9 Note  that  the  recognizer  used  to  generate  transcripts  for  Switch¬ 
board  II  does  not  appear  to  be  of  the  same  vintage  as  that  used  to  gen¬ 
erate  transcripts  for  the  2005  Evaluation  data. 


Using  the  ten  split  training  files,  ten  SLPs  were  built  and 
applied  to  the  system  scores  for  their  respective  split  con¬ 
trol  files.  The  score  combination  results  for  the  splits 
were  concatenated,  and  the  thresholds  to  be  applied  for 
the  2005  evaluation  were  determined.  Then,  new  SLPs 
were  built  from  the  entire  2004  control  file  for  each  con¬ 
dition  to  be  applied  to  the  2005  evaluation,  but  using  the 
thresholds  determined  from  the  combination  of  the  splits. 

5.  Evaluation  Results 

This  section  presents  the  performance  results  of  the  indi¬ 
vidual  component  systems  as  well  as  that  of  the  overall 
submitted  system  for  each  of  the  one-speaker  detection 
conditions.  The  performance  is  shown  using  DET  plots, 
and  in  some  cases,  the  corresponding  system  minDCF 
and  EERs  are  given. 

5.1.  10sec4w  Testing  Conditions 

Figures  1(a)— (d)  show  the  performance  of  the  compo¬ 
nent  and  combined  systems  for  10sec4w  testing  with 
the  training  conditions  of  10sec4w,  lconv4w,  3conv4w, 
and  8conv4vv,  respectively.  The  minDCF  values  for  the 
combined  systems  for  the  10sec4w,  lconv4w,  3conv4w, 
and  8conv4w  training  conditions  were  0.0860,  0.0590, 
0.0522,  and  0.0485,  respectively,  while  the  EERs  were 
28.20%,  17.02%,  13.30%,  and  12.65%,  respectively. 
From  the  plots,  one  can  see  that  the  combination  of  the 
LPCC,  FMBWF0,  and  MFCC/GMM  systems  leads  to 
substantial  performance  improvement  relative  to  that  of 
the  standard  MFCC/GMM  system  for  the  10sec4w  train¬ 
ing  condition;  however,  the  combination  systems  do  not 
yield  any  substantial  performance  improvement  for  the 
lconv4w,  3conv4w,  and  8conv4w  training  conditions. 
The  FMBWF0  and  LPCC  systems  combine  with  the 
MFCC/GMM  system  to  yield  a  6.0%  relative  improve¬ 
ment  in  minDCF  and  a  10.9%  relative  improvement  in 
EER  over  those  obtained  solely  with  the  MFCC/GMM 
system  (minDCF  =  0.0915,  EER  -  31.65%).  Also,  in  the 
I0sec4w  training  condition,  the  FMBWF0  system  per¬ 
forms  almost  as  well  as  the  MFCC/GMM  system. 

5.2.  lconv4w  Testing  Conditions 

Figure  2(a)  shows  the  performance  for  the  10sec4w- 
lconv4w  training/testing  condition.  Note  that  for  the 
1 0sec4w-lconv4w  condition,  the  roles  of  the  training  and 
testing  files  were  reversed.  Thus,  models  were  built  us¬ 
ing  the  lconv4w  files  from  the  test  list,  and  the  frames 
of  the  10sec4w  training  files  were  scored  against  these 
models.  With  this  role  reversal,  this  condition  was  similar 
to  the  Iconv4w-10sec4w  condition  shown  in  Figure  1(b). 
Figure  2(a)  shows  that  the  FMBWF0  and  LPCC  systems 
provide  a  benefit  over  the  MFCC/GMM  system  alone, 
improving  the  minDCF  from  0.0708  to  0.0675  and  im¬ 
proving  the  EER  slightly  from  20.75  to  19.93%. 
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Figure  1:  DET  plots  for  the  10sec4w  testing  conditions  showing  the  performance  of  the  LPCC,  FMBWF0,  and 
MFCC/GMM  systems  with  that  of  the  combined  systems  for  the  (a)  10sec4w,  (b)  lconv4w,  (c)  3conv4w,  and  (d)  8conv4w 
training  conditions. 
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Figure  2:  DET  plots  for  the  lconv4w  testing  conditions  showing  the  performance  of  the  LPCC,  FMBWFO,  MFCC/GMM, 
MFCC/PS-GMM,  and  WLM  systems  with  that  of  the  combined  systems  for  the  (a)  10sec4w,  (b)  lconv4w,  (c)  3conv4w, 
and  (d)  8conv4w  training  conditions. 


lconv4w 

3conv4w 

8conv4w 

System 

minDCF 

EER 

minDCF 

EER 

minDCF 

EER 

MFCC/GMM 

0.0383 

10.33% 

0.0293 

7.11% 

0.0265 

7.09% 

MFCC/PS-GMM 

0.0381 

10.33% 

0.0271 

6.70% 

0.0220 

5.75% 

Cl:  Original  Combination 

0.0371 

9.15% 

0.0249 

6.22% 

0.0197 

5.66% 

C2:  Combination  Without  Sigmoid 

0.0335 

8.88% 

0.0239 

6.02% 

0.0187 

5.30% 

Relative  Improvement  from  Cl 

3.2% 

11.4% 

15.0% 

12.5% 

25.7% 

20.2% 

Relative  Improvement  from  C2 

12.5% 

14.0% 

1 8.4% 

15.3% 

29.4% 

25.2% 

Table  1:  EER  and  minDCF  for  the  MFCC/GMM,  the  MFCC/PS-GMM,  and  two  combination  systems  for  lconv4w 
testing  with  lconv4w,  3conv4w,  and  8conv4w  training.  The  first  combination  system,  Cl,  is  the  original  combination 
system  discussed  in  Section  4;  the  second  combination  system,  C2,  is  the  combination  system  with  the  sigmoid  removal 
discussed  in  Section  6.  Also  shown  are  the  improvements  in  the  performance  of  the  combination  systems.  Cl  and  C2, 
relative  to  that  of  the  MFCC/GMM  systems  for  each  condition. 


Figures  2(c)-(d)  show  the  performance  of  the  sys¬ 
tems  for  lconv4w  testing  over  the  training  conditions 
of  lconv4w,  3conv4w,  and  8conv4w,  respectively.  The 
MFCC/PS-GMM  system  outperforms  the  MFCC/GMM 
system  for  the  8conv4w  training  condition,  but  doesn’t 
significantly  outperform  it  for  the  lconv4w  and  3conv4w 
training  conditions.  The  combination  system  outper¬ 
forms  the  MFCC/GMM  system  alone  for  the  3conv4w 
and  8conv4w  training  conditions  and  for  the  lconv4w 
training  condition  with  1%  <  Pfa\nt •  Table  1  shows 
the  EERs  and  minDCF  values  for  the  MFCC/GMM, 
MFCC/PS-GMM,  and  the  original  combination  system, 
designated  Cl,  for  the  lconv4w,  3conv4w,  and  8conv4w 
training  conditions  (along  with  the  performance  of  a  sec¬ 
ond  combination  system,  designated  C2,  that  uses  a  sig¬ 
moid  removal  procedure  to  be  discussed  in  Section  6). 
Also  included  are  the  relative  improvements  in  the  per¬ 
formance  of  the  combination  systems  over  that  of  the 
MFCC/GMM  systems.  One  can  see  that  the  Cl  combi¬ 
nation  system  outperforms  the  MFCC/GMM  system  by 
11.4-25.7%  in  minDCF  and  EER  for  these  conditions, 
except  for  the  minDCF  for  lconv4w  training,  which  is 
only  improved  by  3.2%. 

6.  Post-Evaluation  Experiments 

After  the  evaluation,  additional  experimentation  was  con¬ 
ducted  in  an  effort  to  improve  the  utilization  of  the 
MFCC/PS-GMM  system  scores  in  the  overall  combined 
system.  After  training  the  first-stage  SLP  applied  to  the 
MFCC/PS-GMM  scores,  the  output  sigmoid  of  the  SLP 
was  removed.  The  combined  MFCC/PS-GMM  score 
without  the  sigmoid  was  then  used  along  with  the  four 
other  system  scores  to  train  the  second-stage  SLP. 

Figure  3  shows  the  result  of  removing  the  sigmoid 
for  lconv4w  and  8conv4w  training  with  lconv4w  test¬ 
ing,  while  Table  1  shows  the  corresponding  minDCF  and 
EERs  (as  well  as  the  values  for  the  3conv4w  training 
case)  in  the  row  designated  as  “C2:  Combination  Without 
Sigmoid.”  Relative  to  the  original  Cl  score  combination 
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Figure  3:  DET  plot  comparing  the  MFCC/GMM  and 
combined  system  performance  with  and  without  the 
sigmoid  for  lconv4w  testing  under  the  lconv4w  and 
8conv4w  training  conditions. 


method,  the  sigmoid  removal  yields  substantial  perfor¬ 
mance  improvement  in  the  lconv4w  training  condition  in 
the  low  false  alarm  region,  resulting  in  a  9.7%  relative  im¬ 
provement  in  minDCF.  In  contrast,  the  sigmoid  removal 
yields  only  modest  additional  improvement  over  that  pro¬ 
vided  by  the  Cl  combination  system  for  the  3conv4w  and 
8conv4w  training  conditions. 

7.  Conclusions 

We  have  discussed  the  details  and  presented  the  perfor¬ 
mance  of  the  AFRL/HEC  one-speaker  detection  systems 
submitted  for  the  2005  NIST  Speaker  Recognition  Eval- 


uation.  It  was  shown  that  the  FMBWFO  and  LPCC  sys¬ 
tems  combined  with  the  MFCC/GMM  system  to  improve 
the  performance  relative  to  that  of  the  MFCC/GMM  sys¬ 
tem  in  some  of  the  conditions  involving  speech  files  on 
the  order  of  10  sec.  The  MFCC/PS-GMM  provided  ad¬ 
ditional  performance  benefit  over  that  provided  solely  by 
the  MFCC/GMM  system  for  conditions  involving  longer 
training  and  testing  files,  especially  when  combined  us¬ 
ing  the  two  SLPs  with  the  sigmoid  removed  from  the  first 
SLP  after  training. 
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